Polymer Field-Theory Simulations on Graphics Processing Units 
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We report the first CUDA graphics-processing-unit (GPU) implementation of the polymer field- 
theoretic simulation framework for determining fully fluctuating expectation values of equilibrium 
properties for periodic and select aperiodic polymer systems. Our implementation is suitable both 
for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating 
simulations using the complex Langevin approach. Running on NVIDIA® Tesla T20 series GPUs, 
we find double-precision speedups of up to 30 x compared to single-core serial calculations on a 
recent reference CPU, while single-precision calculations proceed up to 60 x faster than those on the 
single CPU core. Due to intensive communications overhead, an MPI implementation running on 
64 CPU cores remains two times slower than a single GPU. 
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I. INTRODUCTION 

Computational physics has long been a consumer of 
advanced computational resources and a driver for the 
development of cutting-edge hardware. Rapid strides 
in development continue today, and recent installations 
of publicly hosted research supercomputers have begun 
breaking the petaFLOP barrier. However, since single- 
core CPUs hit the power wall over the last decade, ex- 
ploiting modern resources in the physical sciences, which 
tend to deliver problems that are almost uniquely both 
bandwidth and compute limited, requires increasing lev- 
els of software sophistication with multiple layers of care- 
fully designed parallelism. 

A recent flurry of activity has taken place in the 
use of graphics processing units (GPUs) for general- 
purpose computing. These devices are inherently mas- 
sively parallel, with modern GPUs containing hundreds 
of light-weight cores. Single GPUs have recently passed 
the teraFLOP barrier in total throughput for single- 
precision arithmetic. The field that began with manually 
packaging non-graphical computations into the language 
of graphical operations has matured with the release 
of general programming frameworks such as NVIDIA's 
CUDA and OpenCL. The development of these gen- 
eral programming frameworks, combined with the re- 
cent inclusion of hardware double-precision operations, 
has made targeting GPUs an increasingly attractive en- 
deavor for computational scientists, both as low-heat, 
low-expense desktop-size cluster replacements, and as 
high-performance co-processors in distributed cluster ar- 
chitectures. This activity is reflected in the recent uptick 
of publications reporting implementations and perfor- 
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mance metrics of GPU simulation codes in fields as 
diverse as electronic structure theory computational 
fluid dynamics [3, , molecular dynamics f3|, and electro- 
magnetic simulations [1, [^. 

Polymer statistical field theory has proven to be a pow- 
erful theoretical construct for both analytical and numer- 
ical computations on a wide range of equilibrium poly- 
mer systems [7|. Field theories of homogeneous systems 
have been combined with powerful analytical techniques, 
such as the renormalization group, to elucidate impor- 
tant and fundamental phenomena such as the excluded 
volume effect [8,]- Models of inhomogeneous systems, such 
as phase separated polymer alloys and block copolymers, 
have largely succumbed to numerical self-consistent field 
theory (SCFT) , which is a computer simulation method- 
ology based on a mean-field app roximation to the un- 
derlying field theory model^, [ly. At the frontier of the 
field are "field-theoretic simulations" (FTS), which at- 
tempt to stochastically sample the unapproximated (and 
complex valued) field theory, thereby enabling the study 
of fluctuation phenomena in both homogeneous and in- 
homogeneous polymer systems [tI [Til. [T2j . 

In this article, we present details of a GPU implemen- 
tation of the polymer field theory framework, including 
beyond-mean-field simulations, and report favorable run- 
times compared to the same code running both in serial 
and in parallel on CPUs. We have taken care to report 
benchmark timings together with precise models of the 
hardware involved. There are limited reports of GPU 
implementations of other polymer field theory simula- 
tion methods in the literature. Gao and coworkers'!^ re- 
cently reported a self-consistent field theory implementa- 
tion specifically for semi-flexible block copolymers in two 
dimensions, with an observed 5.3 x speedup over their 
CPU code. Wright and Wickham demonstrated a peak 
30 X speedup on GPUs for a simple Landau-Brazovskii 
type free-energy model of diffusive polymer dynamics fl^. 
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The analytic and real-valued free-energy model used in 
the latter avoids the expensive propagator calculations 
that are the hallmark of SCFT and the more involved 
FTS simulations. To our knowledge, there has been no 
prior report in the literature on a GPU implementation 
of a field-theoretic polymer simulation. 

We note a recent critical review Ts'l of reported GPU 
code speedups by Lee et al. While reports of GPU codes 
running high-throughput tasks 100 x to 1000 x faster 
than their CPU counterparts are not uncommon, Lee 
et al. attribute such favorable timings to comparisons 
between carefully optimized GPU code and less well- 
optimized CPU code. For a wide range of commonly 
used core algorithms, they report comparisons of care- 
fully optimized GPU and CPU code, with the latter ex- 
ploiting all of the avenues for on-chip vectorization, ef- 
fective use of large on-chip caches, and appropriate reor- 
ganizations of memory access patterns. They find that 
CPU code can typically match the runtimes of GPU code 
to within an order of magnitude, if particular attention 
is spent optimizing both codes. Thus, we have taken 
care to ensure that our CPU code, which shares much of 
the code base with the GPU implementation, is aggres- 
sively optimized. In particular, we do exploit on-chip 
vectorization in our CPU code, through the use of hand- 
coded streaming SIMD extensions 3 (SSE3) instructions 
for arithmetic on contiguous vectors of complex numbers, 
and through shared-memory parallelism with OpcnMP 
for exploitation of multi-core architectures. We demon- 
strate in Sec.|V]that with such careful optimizations, our 
runtimes on both CPU and GPU are easily dominated 
by the Fourier transformation steps that are handled by 
high-performance fast Fourier transform (FFT) libraries. 
This dominance is not entirely surprising, since multi- 
dimensional FFTs are not trivially parallelizable and are 
the only component of our calculation that does not scale 
linearly with system size. 



II. THEORETICAL APPROACH 

A comprehensive discussion of the theoretical founda- 
tions of our approach can be found elsewhere Q- Here we 
summarize the most important expressions that underpin 
our numerical scheme. 

We begin by specifying a particle-based model, with 
the relevant degrees of freedom corresponding to posi- 
tions of statistical segments of the polymer chains. In- 
teractions between statistical segments are characterized 
by short-ranged intra-molecular interactions (e.g., Gaus- 
sian stretching, backbone stiffness, ...), and short- or 
long-ranged intermolecular interactions (e.g., excluded- 
volume interaction, Flory-like contact potentials, electro- 
static interactions, . . . ). We limit the remainder of our 
discussion, though not necessarily our simulation code, to 
fully flexible Gaussian chains that interact only through 
contact potentials. In order to decouple interactions be- 
tween statistical segments, we introduce auxiliary fluctu- 



ating fields through a Hubbard-Stratonovich transforma- 
tion and integrate out the then independent particle de- 
grees of freedom. The resulting canonical partition func- 
tion is typically of the form 



Zc = 



ZollJ Vl,,cxp{-H[{l,}])., 



(1) 



where Zq is some field-independent constant, ujj (a?) are 
the set of auxiliary "chemical potential" fields introduced 
to decouple interactions, / Vuii denotes a functional inte- 
gral over such a field, and is a complex-valued energy 
functional that contains model-dependent interactions. 
Other statistical ensembles are readily derived. 

The energy functional, H, usually contains both ex- 
plicit and implicit dependences on the fields. For exam- 
ple, the Jildwards model of homopolymers in an implicit 
solvent[16||, which has a single chemical potential field, 
may be written as 
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dx[Lu{x)Y -CVlnQ[iuj], 



(2) 



where _B is a measure of the excluded volume interaction, 
C is the number of polymer chains per unit volume, V is 
the system volume in units of the free-polymer radius of 
gyration, Rg, and Q is the normalized partition function 
discussed below. Similarly, the well-studied case of an 
incompressible melt of diblock copolymer consisting of 
connected blocks of incompatible "A" and "B" segments 
can be written in terms of "pressure" and "exchange" 
fields as0,[i3 



XabN 



-CV\nQ[uA,ujB], 
(3) 

where xab is the Flory contact interaction between sta- 
tistical segments of "A" and "B", and N is the total 
number of statistical segments in the copolymer chain. 
The chemical potential fields lja and ujb are the fields 
felt by "A" and "B" segments respectively, and they are 
related to the pressure and exchange fields (a;+ and a;_ ) 
through the mappings uja = — i-^-, i-jb = + 

Both of these model Hamiltonian functionals contain 
implicit non-linear and non-local dependencies on the 
chemical potential fields through the normalized parti- 
tion function, Q, which is derived from the statistical 
ensemble of a single polymer chain interacting with the 
specified chemical potential field(s). All details of poly- 
mer architecture are embedded within the Q functional. 
While the present discussion is limited to fully flexible 
and monodisperse linear homopolymer or diblock copoly- 
mer chains, the method, and indeed our simulation code, 
is easily extended to more complex architectures such as 
symmetric and fully asymmetric multiblock copolymers, 
branched and star polymer chains, and grafted copoly- 
mers, by simply substituting the appropriate form of Q 
and introducing fields to decouple any extra interactions. 

The intensive part of our simulation method is the de- 
termination of the normalized partition function, Q [{w}]. 
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for a fully specified set of chemical potential fields. We 
write Q in terms of a Feynman-Kac-like propagator, 5, 
as 



Q H = i j dxq (x, s = 1; [{uj}]) , 



(4) 



where s is the polymer backbone contour variable, which 
has been normalized by iV, the length of the polymer, so 
that the chain contour length is 1. g (and, for asymmetric 
chains that are not invariant to reversal of the contour 
direction, its conjugate q^) is determined from a modified 
diffusion equation 



dsQ (f, s; [w]) = [V^ - w (f; s)] q (f, s; [u]) 



(5) 



The s dependence of the chemical potential field, oj, 
is a feature of copolymers that arises from different 
species segments responding to different fields. The most 
common initial condition for Eq. [S] is q (x, s = 0) = 1; 

Qt(x,S=l) = l. 

With the expression for Q formalized, we turn our at- 
tention to the functional integrals over fields in the canon- 
ical partition function, Eq. [T] For many systems that are 
dense and far from phase transitions, the functional inte- 
grals are dominated by saddle-point field configurations 



Zoexp{-H[{u*}]). 



(6) 



The self-consistent-field (SCFT) approach then reduces 
to finding the stationary, saddle-point field configura- 
tions, oj*, for which the negative of the Hamiltonian func- 
tional, —H, is real and large. A necessary condition for 
such a field configuration is for the Hamiltonian to be 
stationary in the sense that its functional derivative with 
respect to any of the fields is zero: 



m 

Slo 



= 0. 



(7) 



For example, in the case of the incompressible diblock 
copolymer melt (Eq. ^ , the mean- field equations are 



6H 



6uj+ 
SH 



C 

= {) = i— {pA {x) + pb [x] - po) , 
Po 

2C C 

= = -uj'L {x) + — {pB (f) - PA {x)) , 

XabN po 



where po is the spatially averaged segment density, and 
PA and Pb are spatially resolved segment densities of "A" 
and "B" components, which emerge from the functional 
derivatives of the normalized partition function, Q, with 
respect to pressure and exchange fields: 

PA ^^^^ ~Q j 9^ 1 ~ ' 

Po 

PB (^) = 7T / dsq (f , s) (x, f - s) , (8) 

^ JfA 



and /a is the fraction of the copolymer chain that is 
composed of "A" segments. The first mean-field equa- 
tion enforces local incompressibility of the melt, while 
the second encourages phase separation of the compo- 
nents — which must be balanced against the loss of chain 
conformational and translational entropy. 

Due to the complex nature of the field theory, care 
must be taken to ensure that the stationary field con- 
figurations satisfying the mean-field equations have the 
correct saddle-point character, in that they correspond 
to local maxima for "pressure-like" fields that enter the 
Hamiltonian as iuj, and to local minima for other fields. 

In order to search for the mean-field configuration 
in an efficient and stable way, we use a dynamical re- 
laxation scheme with appropriate Wick rotations ap- 
plied to pressure-like fields. To move beyond the mean- 
field approximation (SCFT) and affect full FTS sim- 
ulations, we stochastically sample field configurations 
around the saddle-point using the complex Langevin 
(CL) schemelHlili. This scheme ameliorates the sign 
problem deriving from the complex "probability distribu- 
tion" (exp {—H)) that leads to a critical loss of efficiency 
in Monte Carlo schemes. The numerical aspects of these 
methods will be detailed in the following section. 



III. ALGORITHMS 

With a model, and therefore a field-based Hamiltonian, 
defined, field-theoretic simulations require a two-step it- 
erative approach. The outer loop involves updating field 
configurations, either in search of a saddle-point configu- 
ration (dynamical relaxation), or through stochastic ex- 
ploration of the functional integrals (complex Langevin). 
These types of simulation differ only in that the latter 
contains a stochastic driving term in the equation of mo- 
tion of the fields: 



dtu JH , 

9^ = -^&;+^("'')' 



(9) 



where it is understood that here the fields, w, are 
not Wick rotated, and where the final term is omit- 
ted for mean-field SCFT calculations. The stochastic 
driving, 77, is a real, Gaussian noise that is decorre- 
lated in both space and time, and is defined by its 
first and second moments, which are selected to re- 
produce the correct time-averaged distribution using 
the fluctuation-dissipation theorem: {ri{x,t)) — and 
{ri {x, t) r] {x\t')) = 2XS {x - x') S {t - t'). 

In order to solve for the time evolution of the chemical 
potential fields with discrete time stepping of the equa- 
tion of motion (Eq. |9|), we employ recently developed 
stable and accurate integration schemes. For SCFT cal- 
culations, the most important numerical consideration 
is a wide stability window for large time steps. Sta- 
ble numerical schemes admit large time steps and there- 
fore the rapid advancement of the fields to a saddle- 
point configuration, while details of the trajectory taken 
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are less important. For this task, we have found the 
SIS method introduced by Ceniceros and Fredrickson[20j 
to be very efficient. In this approach, the hnearized 
response of the density fields appearing in SH/Slo are 
treated semi-implicitly to confer large gains in stability 
over standard forward-Euler propagation. CL simula- 
tions, on the other hand, require accurate discretization 
schemes for time stepping in order to compute reliable 
time-averaged equilibrium quantities. Achieving accu- 
rate time propagation for simulations in which fluctu- 
ations are large (r; ~ SH/6uj) is not trivial. For this 
task, the recently introduced exponential time differenc- 
ing (ETD) algorithm ^211 [2^ . which incorporates the lin- 
earized force into an integrating factor, has proven to be 
accurate and efficient. Still, with even the most efficient 
algorithms, the emphasis on time-step accuracy, and the 
requirement to collect stochastic averages of operators, 
necessitates that CL calculations in the strong fluctua- 
tion limit usually require one to two orders of magnitude 
more runtime than SCFT calculations, making fluctu- 
ating fleld-theoretic simulations an ideal candidate for 
acceleration with CPUs. 

To further formalize the discrete field-update algo- 
rithms, we discuss in detail only the example of a first- 
order semi-implicit scheme !23|, which reduces to SIS 
in the limit of zero noise. The final discretized time- 
stepping expressions in reciprocal space are: 



Lu — LU 



-XAt 



+ XAtL 



Sid 



t.k 



Tiff 



(10) 



where L implies taking the linear part of the force which, 
in the weak inhomogeneity limit, is typically of the form 

I 



Ap « (J ★ w, where g is a Debye function that depends 
on the polymer architecture. J- is the discrete Fourier 
transform, and f] is the discretized Gaussian noise, which 

has variance {ff'-^ff ^ — '^^A.tSt.t'Ss,s' / AV . Treating 

the linear part of the force semi-implicitly leads to an 
effective wavevector-dependent time step: 



»,fc.t+At _ -.fe,t 

Lu — UJ — 



XAt SH [w*] 
1 + XAtg'' 



(11) 



Any explicitly linear terms in the force expression are 
handled fully implicitly in the time-stepping algorithm, 
which further modifies the latter expression. We note 
here that field updates are entirely local in reciprocal 
space, even accounting for the linearization terms, so that 
the time propagation is insensitive to the choice of sim- 
ulation boundary conditions for conditions compatible 
with plane wave and Fourier bases. We also note that 
the field update equations are order-M (linear scaling 
with grid dimension, M), local, and trivially paralleliz- 
able over the Fourier spatial modes. 

Within the inner loop of a field-theoretic simulation, 
the modified diffusion equation (Eq.[S|) for the chain prop- 
agators, q, must be solved for the static field configu- 
rations so that the normalized partition functions and 
spatially resolved segment densities are available. For 
this task, we have found pseudo-spectrali24i [25j meth- 
ods based on operator splitting to be the most efficient 
schemes. Specifically, we use an extension to the origi- 
nal pseudo-spectral approach that cancels the lowest or- 
der error using Richardson extrapolation, leading to a 
scheme that is globally fourth-order accurate in the step 
size of the discretized contour variable, As ^263. The dis- 
crete solution for a single contour step may be written 



q{x,s + As;[uj]) = -- 



,(-c^(x)As/2) (V^As) (-^(x)As/2) 



4g(-^(5-)A./4)g(V^As/2)g(-.(S)As/2)g(V^A./2)g(-.(x)A./4)l ^ ^. ^ ^ (^^5) ^ (^3) 



where the application of the diffusion operator, 
exp(V^As/2), occurs in Fourier space by forward and 
reverse fast Fourier transform (FFT) operations on the 
propagator. Since the chemical potential field acts lo- 
cally in real space, and the diffusion operator acts locally 
in Fourier space, both operators are diagonal in the rel- 
evant representation, which makes evaluation of the ex- 
ponentiated operators a numerically simple task. As we 
will show elsewhere |27j, this scheme offers the best per- 
formance of our available methods, as measured by the 
wall-clock time taken to reduce the contour discretiza- 
tion error to an arbitrary pre-specifled level (28|. Since 
the modified diffusion equation is semi-local, boundary 
conditions must be specified. Pseudo-spectral schemes 



can be easily developed for Dirichlet, Neumann and pe- 
riodic boundary conditions by using discrete sine, cosine 
or Fourier transforms respectively. A Chebyshev-based 
pseudo-spectral approach is available for more general 
types of boundary conditions [29j. 

IV. STRUCTURE OF THE FIELD-THEORETIC 
SIMULATION CODE 

Our polymer field-theory code was written in C-| — h 
with strict object-oriented design principles and type 
templating. Data is private and strictly encapsulated 
within each class, while an exposed public interface facil- 
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itates manipulation. This strategy allows a strong sep- 
aration between the internal representation of any data 
structures and the operations that can be conducted on 
that data, which guarantees that the code for running on 
GPUs, with its inaccessible memory spaces, or another 
parallel architecture, requires relatively few changes. 

Inheritance is employed throughout the code to ex- 
pose a common interface for objects of classes derived 
from a common parent, e.g., the chain propagators for 
homopolymers, diblock copolymers, asymmetric multi- 
block polymers and other architectures are used identi- 
cally after their creation. This feature provides a flexible 
framework for tackling a large body of field-theory prob- 
lems. 

Our class hierarchy can be broadly divided into low- 
level and high-level classes. Within the low-level par- 
tition are classes that represent device-dependent func- 
tionality, such as an FFTvector container for manipu- 
lating all functions of a single spatial variable that are 
subject to periodic boundary conditions. Through inher- 
itance, these classes are targeted to run on a variety of 
platforms with highly optimized code for GPUs, and for 
CPUs exploiting OpenMP (shared memory) and/or MPI 
(distributed memory) parallelism. The low-level classes 
also include overloaded arithmetic operators to give a 
simple programming model for manipulating fundamen- 
tal objects within the high-level classes. However, for 
optimal efficiency, we avoid binary operators (*, +, . . . ), 
which necessitate creation of temporary objects. Such a 
requirement imposes a high operational overhead when 
temporary objects can be tens or hundreds of megabytes 
in size, and we therefore take care to restructure all steps 
of our algorithms using only compound assignment oper- 
ations (*=, +=, . . . ) and minimal object copies. 

The high-level partition of our code includes classes 
that define the components of various simulation types, 
including solving for the chain propagator, computing the 
normalized partition function and segment density, and 
stepping the discretized equation of motion with various 
"plug-in" field updater algorithms. Since the code imple- 
menting rules of arithmetic are contained entirely within 
the low-level partition, the code in the high-level parti- 
tion is ignorant of the details of the simulation platform. 
Simulations running on a GPU, serial CPU, or paral- 
lel CPUs have identical high-level code, as do simula- 
tions employing single- or double-precision floating-point 
arithmetic. 

Developing code within the high-level partition can 
lead to the introduction of inefficient practices. We avoid 
this problem with careful code design and profiling, en- 
suring that our code is FFT dominated on all platforms, 
and providing for a fair comparison between the runtimes 
for simulations running on GPU or CPU. For example, to 
achieve maximum throughput in solving the discretized 
diffusion equation, all exponentiated field operators used 
in the pseudo-spectral operator-splitting scheme (Eq. [T^ 
are pre-computed when the propagator initial conditions 
are set. Furthermore, specifically for the diffusion op- 



erator, pre-computation may occur just once per simu- 
lation provided the set of plane waves does not change 
(i.e., in the absence of variable-cell methods, which we do 
not consider in the present work). Similarly, the linear 
and linearized force coefficients for the field time stepping 
(Eq. nop are computed only once per simulation. 

A. Serial and Parallel CPU Implementation 

Our simulation code is parallelised over plane waves 
or real-space collocation grid points. For implementing 
the FTS framework, three distinct parallel patterns ap- 
pear: Fourier transformations, direct vector operations, 
and reduction operations. 

Fourier transformations, which are required to solve 
the modified diffusion equation pseudo-spectrally, and 
to make SIS and ETD field updates local, are han- 
dled in CPU-targetted code by the freely available FFTW 
library [13, which has serial and both SIMD and MIMD 
parallel execution capabilities. We follow established 
conventions for data distribution with FFTW. We com- 
piled the FFTW library with SSE extensions, and thor- 
oughly tested the performance of the library with respect 
to various compiler optimizations for each target plat- 
form. For simulations running in parallel, three differ- 
ent strategies are available: shared-memory parallelism 
with OpenMP [H, distributed-memory parallelism with 
explicit message passing using MPI, and a combination 
of the two. With a dual-parallel execution model, it is 
important to be sure that CPU cores do not become over- 
subscribed by the launch of more threads than there are 
available cores. 

Direct vector operations, used for example to apply 
exponentiated operators in the pseudo-spectral scheme, 
or to time-step fields once "forces" are fully determined, 
are trivially parallelisable, requiring no thread or process 
cooperation. 

Reduction operations are used throughout our code to 
perform spatial integrations (e.g., in evaluating the nor- 
malized partition function, Eq. 21 or operators such as 
the energy functional, Eqns. [2] and [3]), or for evaluating 
any type of norm of a field or function. We implement 
reduction operations within MPI using MPI_Allreduce 
operations, and within OpenMP using private variables 
for thread-level block reductions, followed by atomic up- 
dates of a shared result variable. 

We use version 12 of the Intel C++ compiler wrapped 
with OpenMPI for compiling our CPU-target code. 



B. General-Purpose Computing on Graphics 
Processing Units 

GPU architectures are optimized for high throughput 
computing rather than peak single-thread performance. 
They consist of a collection of symmetric multiprocess- 
ing units, each of which has a number of thread proces- 



6 



sors with distinct arithmetic logic units. For example, 
NVIDIA T20 series has 14 multiprocessors with 32 cores 
per processor, for a total of 448 cores. GPU cores are rel- 
atively lightweight with simple instruction sets, and the 
hardware lacks many performance aids present in mod- 
ern CPUs, such as multiple pipelines and large on-chip 
caches. However, thread support is implemented in hard- 
ware, so that thread creation, destruction, switching and 
synchronization are low-overhead operations. 

In principle, code can be written to target simultane- 
ous GPU and CPU execution with load sharing. How- 
ever, in most cases this would inevitably require explicit 
data transfer through the PCI-express bus, which im- 
poses a high penalty for simulation codes handling large 
data sets. For field theoretic simulations, we have yet to 
find any performance benefit in such load sharing. 

While modern GPUs have a large store of shared 
device-level DRAM (3-6 GB on NVIDIA Tesla T20 se- 
ries), memory accesses are relatively slow with high la- 
tency. This dictates a powerful strategy for achieving 
high performance: the GPU should be overloaded with 
threads. Threads that are delayed due to memory la- 
tency are switched out by a scheduler in favor of those 
that are ready to run. The switching overhead is negligi- 
ble, so in this way the memory latency can be "hidden" 
by overloading the GPU. 

Threads are launched in groups, called blocks, of size 
determined by code implementation. Efficient thread 
communication occurs only within blocks by individual 
threads loading and storing of data in a small reserve of 
on-chip shared memory. The shared memory is usually 
very limited, typically less than 64KB, but accesses are 
very fast. Hence, in addition to being used for thread 
cooperation, the shared memory space can be used as a 
temporary store of data fetched from global device mem- 
ory, to avoid subsequent repeat reads, and an important 
aspect of code optimization is the careful use of this lim- 
ited resource. The blocking of threads imposes an impor- 
tant constraint for translating algorithms into GPU code: 
the order of block execution is undefined due to dynamic 
thread scheduling, so that cooperation between threads 
in different blocks is difficult and inefficient. For best 
performance and robust code, the computation should 
not depend on the order in which blocks are executed. 

At any given snapshot in time, threads will be running 
in multiples of the warp size (32), regardless of the block 
size. Threads that are executing the same instructions 
concurrently will benefit with improved performance, as 
will threads that read neighboring memory addresses in 
device memory. Hence, minimizing thread "divergence" , 
in terms of executed instructions and memory access pat- 
terns, yields optimal computational throughput. 

The general programming frameworks for GPUs that 
have recently been developed, CUDA and OpenCL, pro- 
vide a small set of extensions to existing programming 
languages. Host (CPU) code is compiled and executed 
as before, but extensions are included for orchestrating 
the launch of GPU "kernels" from the host code. With 



the release of these frameworks, GPU programming is 
significantly simplified, and the development of efficient 
code can be achieved by noting the small set of consid- 
erations listed above. 



C. CUDA implementation of Field-Theoretic 
Simulations 

Given the design and optimization constraints out- 
lined in the previous section, two crucial design principles 
emerge. Efficient GPU use requires many threads to hide 
memory latency, each of which may be created and de- 
stroyed for the sole purpose of executing a single arith- 
metic operation on a single element of data. Further- 
more, data transfers between host system memory and 
GPU device memory should be infrequent. For these rea- 
sons, our field-theoretic code is fundamentally designed 
to permanently store all quantities on the GPU device 
memory. Transfers of data to host system memory only 
occur for the infrequent output of instantaneous opera- 
tor values and density-field snapshots. In addition, where 
possible, each thread during the execution of GPU ker- 
nels handles only a single plane-wave coefficient or real- 
space collocation sample of a function. 

For the parallel patterns discussed in Sec. IIV Al we 
have written efficient CUDA kernels for parallelising the 
operations over individual spatial samples. The excep- 
tion is the Fourier transformations, which are handled 
by an efficient library, cuFFT, included with the CUDA 
software development kit. Thread computations within 
this library are by far the dominant part of our simulation 
runtime. 

Complex Langevin calculations require generation of 
the Gaussian noise data for every field update (see 
Eq. fTU]) . The normally distributed noise can be gener- 
ated with thread- wise parallelism on the GPU using the 
CUDA "CURAND" library. The net result on runtimes 
is an overall ^ 10% increase in performance for CL sim- 
ulations, compared with running the simulation on the 
GPU but generating the Gaussian noise in serial on the 
CPU. 

Considering the vastly increased, but still limited, 
global memory size of 3-6 GB of DRAM available on 
modern GPUs, a brief analysis of the memory costs of an 
FTS calculation is in order. Taking the example of a di- 
block copolymer melt that contains only a single distinct 
forward-chain propagator, we assume 100 contour steps 
(As) are required to achieve well-converged results for 
Eq. [T^l [13] Assuming all spatially dependent functions 
are resolved with 128 plane waves in each of three spa- 
tial dimensions, corresponding to a large calculation, a 
single function of space represented with 128-bit double- 
precision complex numbers requires 32MB of storage. 
Thus, the chain propagator requires ~ 3 GB of data. 
Upon introduction of the conjugate propagator, which is 
the same size, we already exceed the storage available on 
the largest GPUs. Each field and exponentiated opera- 
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tor, of which there are relatively few, also requires 32 MB. 
To save memory, we do not store q and separately, 
but rather the quantity q{s) [1 ~ s). This quantity is 
computed and stored on-the-fly during the diffusion com- 
putation, so that the individual forward and conjugate 
propagators are never required to be stored individually. 
Provided the propagators are only used to calculate the 
partition function and the density field, this quantity is 
all that is required, which allows for a 50% reduction in 
memory cost. (Note that other properties, such as the 
osmotic pressure or stress tensor, are more complicated 
functionals of the chain propagators, and this memory- 
saving measure cannot be employed when such quantities 
are required). With this choice, almost all present simu- 
lations of interest can fit comfortably into 6 GB of global 
storage. However, for the rare case that the GPU device 
memory is exhausted, we have implemented the option to 
move the quantity g (s) (1 — s) into host system mem- 
ory as it is being computed. This saves > 95% of the 
memory cost of a typical calculation, at the expense of 
some performance loss. 



V. RESULTS AND BENCHMARKING 

For our performance testing, we compare code run- 
times on late-model CPUs and GPUs deployed in the 
knot cluster at the California Nanosystems Institute at 
the University of California, Santa Barbara. The CPUs 
used in our tests were Intel® six-core Xeon E5650 pro- 
cessors clocked at 2.66 GHz. Each node of the cluster has 
two such six-core CPUs installed on the motherboard, for 
a total shared-memory parallel capacity of 12 cores. 115 
nodes are connected in parallel using fast Infiniband in- 
terconnects (MT26438 from Mellanox Technologies) for 
a total parallel capacity of 1380 cores. 

For the tests conducted purely with MPI-level par- 
allelism, MPI processes can execute concurrently either 
on cores within the same CPU/node, or on CPU cores 
located in different nodes. In either case the parallel 
cooperation proceeds through explicit message passing, 
regardless of whether the cooperating cores have access 
to a common shared memory pool. For MPI computa- 
tions involving more than 12 processes, the simulation 
inevitably must execute on multiple nodes with message 
passing through the Infiniband interconnects, but with 
few processes we find that it is advantageous to run on a 
single node. This indicates that competition between the 
MPI processes for memory bandwidth provides less of a 
overhead than message passing between nodes. In con- 
trast, with OpenMP parallelism execution threads may 
only cooperate through a shared memory pool (typically 
24 gigabytes on knot) and such simulations are therefore 
limited to 12 cores the knot cluster. Finally, for mixed 
OpenMP+MPI parallelism, we exploit shared-memory coop- 
eration within a node and explicit message passing for co- 
operation between nodes. In this case, each node handles 
two MPI process (one for each multi-core CPU), which 



are responsible for orchestrating branching and merging 
of multiple OpenMP threads (up to the maximum of 12 
per node). 

The GPUs used in our tests were NVIDIA Tesla C2050 
boards based on the Fermi architecture with 3GB global 
DRAM. We turn off the error-correcting code (ECC) fea- 
ture on the global memory storage of the GPU, leading to 
faster memory transactions. We observe an overall 10% 
performance boost with ECC turned off. Unless other- 
wise stated, all of our timings are for simulations using 
double-precision arithmetic. 

Fig. [T] shows timing data for self-consistent field calcu- 
lations of the lamellar (LAM) phase of an incompressible 
melt of diblock copolymer. For this case, the calculations 
were run either on a single CPU core (serial execution, 
albeit with some on-core vectorization in the form of SSE 
instructions) or a single GPU (threaded parallel execu- 
tion) . Panel (a) shows absolute timing data with respect 
to the total number of plane waves used to represent the 
field for ID, 2D and 3D simulations. The GPU timings 
are clearly significantly shorter than simulations running 
on a single CPU core for all but the smallest of calcu- 
lations, which suffer performance penalties due to not 
being able to overload the GPU with sufficient threads 
to hide memory latency. The GPU simulations require 
^ 1 sec. for a single field update, even for more than one 
million plane waves. We also find that the GPU and 
CPU performance does not depend significantly on the 
dimensionality of space. 

Fig. [IJb) shows the relative speedup of the GPU 
code over the serially executed CPU code, defined as 
7cpui/?GPU- The performance gap widens with increas- 
ing simulation size, and the peak speedup we observe is 
greater than 30 x . Our largest calculation was a 3D sim- 
ulation with 128 plane waves per dimension, for a total of 
2, 097, 152 spatial degrees of freedom. Larger simulations 
are possible, but they become increasingly challenging 
from the perspective of obtaining accurate timing statis- 
tics for single-core serial CPU references due to the long 
runtimes involved. 

To demonstrate that the favorable GPU scaling we 
observed in the previous tests does not derive from a 
fortuitous benefit of phases such as LAM with transla- 
tional invariance in two of the spatial dimensions, we 
reran benchmarking tests on the gyroid phase of the in- 
compressible diblock copolymer melt[3l|. This phase, 
shown in Fig. [2l has no unrestricted translational invari- 
ance and cannot be represented in fewer than three spa- 
tial dimensions. The results shown in Fig. |3] expose a 
similar > 20x peak speedup over a single CPU core as 
found for simulations in the LAM phase. In this case, 
we also compare to parallel CPU simulations conducted 
both with shared-memory (OpenMP) parallelism, and 
with distributed-memory (MPI) parallelism. We note 
that when all MPI processes are launched on a single 
shared-memory node when possible, the performance of 
our MPI implementation is comparable to that of our 
OpenMP implementation, indicating a low overhead of 
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FIG. 1. (Color online) Benchmarking data for SCFT calculation of LAM phase in ID, 2D, and 3D. In all simulations /a ~ 0.5 
and xabN = 14.0 in a cubic cell with side length 10.89 Rg, and fields were initially seeded with decorrelated random noise, (a) 
Absolute timings in seconds per field update (propagator calculation + field stepping). The CPU code is executed in serial on 
a single CPU core with on-core SSE vectorizations enabled. Our GPU code can perform one full field update in 1 sec. for 
even very large basis sets corresponding to 128^ plane waves, (b) Relative speedup of GPU run over a single CPU core. All 
timings are averaged over three runs, with each simulation 1000 field updates in length. Lines are a guide to the eye, and are 
not indicative of runtime or scaling for non-power-two simulation grids. See body text for hardware details. 




FIG. 2. (Color online) Volumetric plot of the minority- 
component density of a diblock copolymer melt in the in 
the gyroid phase. The chosen Flory interaction parameter 
is, XabN = 14.4, and minority block fraction is /a ~ 0.4. 
The simulation domain is of size 8.82 Rg in each dimension. 
Timings for this simulation are shown in Fig. O 



explicit message passing in this limit. Presumably a re- 
sult of the high communications overhead of the trans- 
posing steps taken in distributed-memory implementa- 
tions of multi-dimensional FFTs, we find an MPI per- 
formance that saturates at ^ 10 x speedup even for 64 
CPUs. The result is that a single GPU outperforms even 
one hundred CPUs cooperating via MPI (not shown) for 
our FFT-dominated method. 

As discussed in Sec. IIV CI for the rare simulations 
that exceed the GPU global shared memory capacity 



we have implemented the option to stage the quantity 
g (s) (1 — s) on-the-fly into host memory following each 
contour step in solving the discretized modified diffusion 
equation, Eq. [121 Subsequently, the density fields are 
calculated in serial on a single CPU core through inte- 
grations over the contour variable (e.g., Eq.|5]). With this 
option, the memory allocation on the GPU is limited to a 
small number of individual fields and operators, while the 
much larger s-dependent propagators are never stored lo- 
cally. While this option typically saves more than 95% 
of the memory, depending on the underlying model, it 
also incurs some performance penalty both due to loss of 
parallelism on performing the contour integrals to eval- 
uate the density, and from the cost of transferring giga- 
bytes of data to host memory during the diffusion con- 
tour stepping. Fig. 2] shows the performance loss from 
this memory-saving option, which amounts to approxi- 
mately 50% loss of the peak acceleration obtained from 
running entirely on a GPU. 

Due to the reliance of our simulation code on type 
templating, switching arithmetic from double-precision 
(64-bit fioating point numbers) to single-precision (32-bit 
floats) is a simple task. Before the launch of the Fermi 
architecture, NVIDIA CPUs did not have hardware sup- 
port for double-precision arithmetic, and even the latest 
hardware suffers a FLOP throughput reduction of a fac- 
tor of two when performing double-precision arithmetic. 
Hence, the ability to run calculations in single precision 
is desirable both to support older hardware, and to fur- 
ther improve performance. While single-precision arith- 
metic carries an associated increase in numerical error, 
which may accumulate during the solution of our non- 
linear differential equations, the advanced stable algo- 
rithms that we use sufficiently suppresses serious error 
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FIG. 3. (Color online) 3D SOFT simulation of the cubic gyroid phase of an incompressible diblock copolymer melt. In all 
simulations, /a ~ 0.4, xabN = 14.4 and the simulation cell is a cube with side 8.82 Rg~^. (a) Absolute timings in seconds 
per field iteration (diffusion + field stepping), (b) Relative speedup of GPU and parallel CPU implementations over a single 
CPU. TpLL refers to the timing on any of the parallel platforms, including GPU. All timings are averaged over three runs, 
with each simulation 1000 field updates in length. Lines are a guide to the eye, and are not indicative of runtime or scaling for 
non-power-two simulation grids. See body text for hardware details. 
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FIG. 4. GPU calculation speedup over a single CPU core for 
the ID LAM phase shown in Fig. [1] Comparison of full sim- 
ulation conducted on the GPU to a calculation in which the 
density is calculated using the CPU. The latter option signifi- 
cantly saves GPU device memory with some performance loss 
due to data transfer overhead and loss of parallelism. 
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FIG. 5. Relative speedup of GPU calculations conduction 
with single-precision floating-point arithmetic (32 bits per 
float) proceed approximately two times faster than those us- 
ing double precision. The reference timings are single-core se- 
rial CPU calculations using double-precision arithmetic. Nu- 
merical and model details of the LAM and GYR phase cal- 
culations shown here are the same of those used in Figures [1] 
and [3] respectively. 



accumulation and consequent destabilization. The util- 
ity of single-precision arithmetic in SCFT simulations 
is obvious for rapidly evolving to the saddle point so- 
lution of the mean-field equations. Further refinement 
with double-precision processing can be performed if de- 
sired once the field iteration has converged in single pre- 
cision. The runtime speedup for LAM and GYR phases 
using single-precision arithmetic on a Tesla C2050 GPU 
are shown in Fig. El with the reference runtime being the 
double-precision timings for serial CPU execution. We 
note that the CPU-only execution is not significantly ac- 
celerated when running with single-precision arithmetic, 



despite the use of SSE3 instructions. 

Fig. [S] shows a comparison of serial, MPI and GPU 
runs of a complex Langevin simulation of the cu- 
bic gyroid phase for relatively strong field fluctuations 
(determined 12] for the model of interest by the chain- 
density parameter C — 50). For this case, we plot the 
dynamical trace of the real part of the Hamiltonian op- 
erator. Once equilibration of the CL such trajectories 
has been achieved, time averages correspond to weighted 
averages over field configurations, and therefore to ex- 
pectation values of the averaged operator. Note that 
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FIG. 6. CL dynamical trajectory of (H) plotted against 
wall-clock runtime for the simulation of a single cubic unit 
cell of the gyroid phase of an incompressible diblock copoly- 
mer melt. The simulation parameters are the same as for the 
mean-field calculations of the same phase shown in Fig. O 
For this example, we choose to represent fields with a grid 
of 64'^ plane waves. The relative strength of fluctuations is 
determined by C, which we set to 50 for this simulation. Op- 
erator values were saved after every 100 field updates. Single- 
precision (SP) GPU calculations progress approximately two 
times faster than double precision (DP) equivalents. 



while the average of the Hamiltonian operator, (H), is 
loosely linked to the Helmholtz free energy [S^l, these 
quantities are not equal because (H) does not include 
all entropic contributions from field fluctuations. How- 
ever, this quantity remains a useful measure of equili- 
bration of the CL trajectory. Achieving fast equilibra- 
tion of CL trajectories for large 3D simulations is a ma- 
jor challenge, and one that is only partially resolved by 
improved algorithms such as ETD. Fig. |6] demonstrates 
that the 20-30 x performance gain provided by running 
on CPUs has a significant impact on the our ability to 
quickly move through the warmup stage of our simula- 
tions, with warmup achieved more than two times faster 
than 64 MPI processes running on 64 CPU cores. Single- 
precision arithmetic can be used to further accelerate the 
warmup process by an additional factor of two. In the ex- 
ample shown, the operator averages in single- and double- 
precision match to with statistical error, though even if 
this were not the case, the equilibriated simulation could 
be continued accurately by switching to double precision 
after the warmup stage. Note that the simulation cell 
used in this example is the smallest possible for the cubic 
gyroid phase, and fully capturing the effects of spatially 
decorrelated fluctuations may require larger simulation 
cells, which can require days to reach equilibrium with 
traditional MPI codes. 

Finally, we analyze the flat proflle of serial single-core 
CPU and GPU runtimes shown in Fig. [Tfa) and (b) re- 
spectively for the gyroid calculations presented in Fig. |31 
The proflle data demonstrates that the GPU runtime 



is dominated by fast Fourier transformation operations, 
with approximately 70% of runtime consumed by this 
single activity. The remaining largest consumers of run- 
time are element-by-element products of fleld data for 
complex-complex multiplies (Z * Z) or complex-real mul- 
tiplies {Z * D). These three operations collectively ac- 
count for > 90% of runtime on all simulation sizes. The 
latter two, involving multiplies within CUDA kernels, are 
strongly bandwidth limited because multiple read/write 
operations are required for each arithmetic operation. 
We note that the design strategy of restricting all data 
to the GPU device memory, and not utilizing the host 
storage, results in less than 5% of runtime being spent 
on CPU code and memory copies. We find similar pro- 
file data for our CPU-executed code, with the exception 
that FFT costs reduce to only ~ 50% for the smallest of 
system sizes. 



VI. CONCLUSIONS 

We have demonstrated that field-theoretic simulations 
running on a recent model graphics processing unit can 
progress up to sixty times faster than those running in 
serial on a single core of a recent model CPU. This en- 
hanced performance was observed in simulations con- 
ducted in the mean-field (SCFT) limiting case, as well 
as in complex Langevin simulations of the fully fluc- 
tuating fleld theory. Due to the intensive communica- 
tions overhead associated with fast Fourier transforms, 
the same code running on CPUs in parallel, either with 
shared or distributed memory strategies, saturates with 
approximately ten times speedup over a serial calcula- 
tion. Hence, a single GPU currently provides the highest 
throughput capacity for difficult fleld-theory problems, 
even when compared to the largest CPU-based clusters. 
The performance gains that we observe make fully fluc- 
tuating, large-scale 3D simulations of complex phases 
much more tractable. Since GPU technology is evolv- 
ing rapidly, and compute throughput is increasing signif- 
icantly with each generation, we anticipate that future 
hardware developments or enhancements to the CUDA 
numerical libraries will provide further signiflcant "drop- 
in" gains in performance with little or no extra code de- 
velopment. 

While our peak GPU performance is very favorable, we 
note that small calculations do not beneflt from running 
on a GPU. In such cases, it is not possible to provide 
sufflcient threads to hide memory latency. We flnd the 
break-even point to be somewhere between 512 and 2048 
total plane waves, which is a smaller basis size than is re- 
quired for almost any problem of interest involving large 
simulation cells. 

Strategies for further improvements in performance 
could include sharing the computation load between CPU 
and GPU, identifying strategies for exploitation of asyn- 
chronous GPU kernels, and adding additional levels of 
parallelism so that calculations can run simultaneously 
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on multiple GPUs. However, due to slow memory trans- 
fers and high communications overhead, it is unclear 
whether these strategies would provide significant further 
gains in performance. 
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